How to analyse scanem output

How to analyse scanem output

Introduction

This vignette focusses on further analysing output from scanem. After running scanem, it outputs a couple of different types of information:

  1. The weights of the convolutional kernels, or ‘motif detectors’, which are comparable to PWMs
  2. The alignments of convolutional kernels to a database of motifs, with significance (q-)scores for each alignment
  3. The weights of the neural network, weighing the motif impacts in each pool

The final part of scanem trains 10 different neural networks, each using a different 90% of the dataset as the training data (and 10% of the dataset as the test data). Each model contains 300 motif detectors, so there are a total of 3000 motifs. One of the goals of scanem is to identify motifs and motif clusters that show up repeatedly across the 10 models - we set a threshold of motif clusters that show up in at least half of the models.

The motifs are named as MODELNUMBER_MOTIFNUMBER, e.g. motif 2_3 is motif 3 from model 2. The best performing out of 10 models is named “best” rather than its model number, e.g. motif best_129.

All of the motifs are aligned with Tomtom, the motif comparison tool from the MEME Suite, against a database of known motifs. This generates a tab-separated file containing what motifs aligned to what reference motifs, and the q-values for the alignments. One of the things to do here is to generate a bipartite graph of the model motifs and the database motifs. Then it becomes possible to select reproducible motifs. For a visual overview, see the figure below.

One of the outputs from the model is a HDF5 file containing motif detector and neural network weights for each model. The neural network weights are weights for each motif detector in each pool - the network uses these to “weigh” the importance of motifs in the different pools.

Below you can see a visual overview of the workflow for motif analysis using scanem. This workflow will start at the third step, where we have the weights from the models and the motif alignments from Tomtom.

scanem_output_workflow

Loading the required packages and helper functions

During this workflow, I’ll use the following packages: (see end of the page for session info):

suppressMessages(library(dplyr))
suppressMessages(library(ggplot2))
suppressMessages(library(igraph))
suppressMessages(library(mixtools))
suppressMessages(library(rhdf5))
suppressMessages(library(SingleCellExperiment))
suppressMessages(library(stringr))
suppressMessages(library(xtable))
suppressMessages(library(ggrepel))
suppressMessages(library(pheatmap))
suppressMessages(library(magrittr))
suppressMessages(library(reshape2))

options(stringsAsFactors = F)

I’ve added some custom functions to the following file:

source("scanem_helper_functions.R")

Set up paths to files

For the rest of the analysis, I am using paths to the network output (exp_folder), the input dataset (data_path), and the cell type annotation (colData_path). I also need to input the motif database that was used for motif alignment (motif_db_path). I will also create an output directory for the analysis.

# Network output dir
exp_folder <- "~/jh47/maydev_2/output/tm_7organs_pool80_5u5d/"

# Network input datasets
data_path <- "~/jh47/maydev_2/data/20200901_tm_7organs_pool80_5u5d.tsv.gz"
colData_path <- "~/jh47/maydev_2/data/20200901_tm_7organs_pool80_5u5d_colData.tsv"

# Motif alignment db
motif_db_path <- "~/jh47/maydev_2/resources/Mus_musculus.meme"

# Output directory
outdir <- "20200909_test_out"
dir.create(outdir)
## Warning in dir.create(outdir): '20200909_test_out' already exists

Reading the dataset, alignments, and network output

Using the paths defined above, I can now use the functions from scanem_helper_functions.R to read the different parts for downstream analysis. Firstly, I will construct a SingleCellExperiment object using the pooled dataset, read the tomtom output and filter it to include only the genes that are expressed in the dataset, and I will read the weights matrix and scale it.

For the latter step, I will scale the weights in the neural network layer by the maximum possible motif activation. This way, it will more accurately reflect the impact of a motif hit in a particular pool. This is all done using the read_network_hdf5() function.

sce <- scanem_dataset_to_sce(data_path = data_path,
                      colData_path = colData_path)

tomtom_table <- read_tomtom_output(motif_db_path = motif_db_path,
                                    exp_folder = exp_folder)

tomtom_table <- filter_tomtom_table(tomtom_table = tomtom_table, sce = sce)

weights_matrix <- read_network_hdf5(exp_folder, sce)

Motif alignment graph

I will now use the tomtom alignments to construct a bipartite graph using the Tomtom alignments. After this, I use igraph::cluster_walktrap() to cluster these alignments, and I use plot_alignment_graph() to generate two files in the outdir; a motif alignment plot with/without cluster annotation. These plots are quite messy but give a good overview of how many different motif clusters are found.

alignment_graph <- create_alignment_graph(tomtom_table)

# Cluster alignment graph
alignment_graph_clusters <- igraph::cluster_walktrap(alignment_graph)
alignment_graph_cluster_groups <- igraph::groups(alignment_graph_clusters)

# Plot two alignment graphs
plot_alignment_graph(outdir, alignment_graph, alignment_graph_clusters)

To get the motif clusters into a dataframe, I use a custom function:

cluster_df <- construct_cluster_df(exp_folder = exp_folder,
                                   alignment_graph_cluster_groups =
                                     alignment_graph_cluster_groups)

I also want to construct a matrix where the motif weights are averaged per cell type:

average_weights_matrix <- construct_average_weights_matrix(weights_matrix)

I use this function to get information about the motifs in this matrix. This will also plot the motif annotation as “good” or “bad”:

motif_stats <- get_motif_stats(cluster_df = cluster_df,
                               average_weights_matrix = average_weights_matrix,
                               pseudocount = 1e-11)
## number of iterations= 3
## Some 'dead motifs' found

The motif “impact” is how much a motif will affect the expression of cells In many runs you will encounter something like this, where a subgroup of motifs has a way lower impact than the others. (I suspect this is because of “dead ReLUs” in the network). The previous function has annotated the lower impact motifs as “bad”. (If this function does split good/bad motifs well for you, it might be worth changing the pseudocounts option of this function to a lower value).

The get_motif_stats() function will also create a preliminary motif cluster annotation to show what kind of motifs are included in the cluster. However, this will need to be refined (which we will get to later).

For further analysis, I will only consider motifs from motif clusters that show up in >= 50% of the models, that aligned to known motifs, and that are annotated as “good”.

Let’s filter the values so that it only keeps the good motifs:

# Get names of good motifs: the ones that have high impact, are from
# reproducible motif clusters, and are aligned to known motifs
good_motifs <- rownames(motif_stats)[motif_stats$is_good_motif=="Yes" &
                                       motif_stats$cluster_reprod >= 0.5 &
                                       motif_stats$cluster_annot != "NA"]

weights_matrix <- weights_matrix[good_motifs,]
motif_stats <- motif_stats[good_motifs,]
average_weights_matrix <- average_weights_matrix[good_motifs,]

Plotting the motif weights across cell types

If we were to plot the average weights of the motifs across the cell types, it would look something like this:

# Only use this function if you have 10 or fewer motif clusters at this point.
annotation_colors <- get_annotation_colors(motif_stats)

# The "drop = FALSE" will make sure the annotation stays a dataframe
pheatmap(average_weights_matrix,
         annotation_row = motif_stats[,c("cluster_annot"), drop = FALSE],
         show_rownames = FALSE, angle_col = 45,
         annotation_colors = annotation_colors,
         cellwidth = 8,
         cellheight = .5)

Here, every row corresponds to a motif detector and every column corresponds to a cell type.

As you can see, a small subset of the motifs has the largest impact. In this case, this seems to be the Gabpa/Erg/Etv5/... cluster (ETS family motifs). One other thing to note is that many of the motif clusters do not cluster together all too well.

If we instead plot Z-transformed motif weights, it looks like this:

# Use custom function to get
average_weights_matrix_z_scores <- to_z(average_weights_matrix)

pheatmap(average_weights_matrix_z_scores,
         annotation_row=motif_stats[,c("cluster_annot"), drop=FALSE],
         show_rownames = FALSE, angle_col = 45,
         annotation_colors = annotation_colors,
         cellwidth = 8,
         cellheight = .5)

Already you can see that the motif clusters cluster together much better. Do note that the relative impacts of motifs in a single cell type cannot be deduced from this heatmap, as the values have been transformed per row. What this does show is the general ‘trend’ of the weights of a motif across the cell types.

Z-scores can, however, inflate the differences between cell types. This is why it is beneficial to have another score that tells you something about how “cell-type-specific” a motif weight really is. For this, the “motif cell type entropy” score has been calculated for each motif, which is the normalised Shannon entropy of a motif weight across cell types. The lower this score is, the further from a uniform distribution the weight distribution is.

Let’s plot the motifs along with the entropy scores. I will also order the motifs for their motif cluster annotation:

pheatmap(average_weights_matrix_z_scores[order(motif_stats$cluster_annot),],
         annotation_row=motif_stats[,c("cluster_annot",
                                       "motif_celltype_entropy")],
         show_rownames = FALSE, angle_col = 45,
         annotation_colors = annotation_colors,
         cluster_rows = FALSE,
         cellwidth = 8,
         cellheight = .5)

Now, the added motif_celltype_entropy annotation in white/green will tell you how cell-type-specific the weights are (whiter = more specific). As you can see, the E2f1 cluster is particularly specific, the Klf6/Sp2/Sp3... cluster has quite mixed scores, and some other clusters, such as Zfp143/Smarcc2..., are less specific.

Refining motif cluster annotations

The current motif cluster annotation is very unspecific. It would be nice to identify which of the transcription factors (TFs) from this list is the one that’s the most likely candidate. However, this is one of the more manual steps in this analysis.

To get more information on the TF candidates in the motif cluster, we can exploit the fact that we have expression information for the different TFs. One thing we can then do is look at the correlations of the expression of the TFs in the motif cluster and the pattern of the motif weights across cell types.

The following code will produce plots in the outdir that show the expression pattern for the different cluster candidates, including a Spearman correlation score for the different cluster candidate TFs.

In addition to this, it will generate directories per motif cluster that contain the different plots that led to these correlation scores. It will also return a DataFrame that contains the

cluster_correlations_df <- plot_candidate_tf_information(cluster_df = cluster_df,
                              weights_matrix = weights_matrix,
                              motif_stats = motif_stats,
                              sce = sce,
                              tomtom_table = tomtom_table)

Let’s look at an example, where the cluster with Yy1 and Mbtps2 is analysed:

scanem_output_workflow

Here, we can see that Yy1 has a higher correlation to the motif weights (green) and more Tomtom alignments (purple) than Mbtps2, making it the more likely candidate.

Let’s take a look at the correlation plots for these two:

Yy1 Mbtps2

Here, the cell types are annotated by color. There is an alternative plot generated by the earlier function that adds cell type labels, but those plots can get quite busy.

Using this information, we can update motif cluster annotations. Sometimes it’s hard to narrow it down to the exact transcription factor, in which case I annotate the cluster with the motif family names. Sometimes it can be good to look at individual motifs in the motif cluster and look directly at the alignments in Tomtom; for this, see the all_tomtom/tomtom.html webpage generated by Tomtom.

I do this using the following code (I look at View(cluster_df[cluster_df$cluster_reproducibility >= 0.5,]) to see which number corresponds to which cluster):

cluster_df$cluster_annot[rownames(cluster_df) == "4"] <- "Egr1/Wt1/Klf family motifs"
cluster_df$cluster_annot[rownames(cluster_df) == "7"] <- "ETS family motifs"
cluster_df$cluster_annot[rownames(cluster_df) == "16"] <- "Yy1"
cluster_df$cluster_annot[rownames(cluster_df) == "2"] <- "bHLH family motifs"
cluster_df$cluster_annot[rownames(cluster_df) == "6"] <- "bHLH family motifs 2"
cluster_df$cluster_annot[rownames(cluster_df) == "10"] <- "Zfp143"

# Update the annotation
motif_stats <- get_motif_stats(cluster_df = cluster_df,
                               average_weights_matrix = average_weights_matrix,
                               pseudocount = 1e-11)
## number of iterations= 102
## No clear bimodal distribution: likely not a lot of 'dead motifs'

annotation_colors <- get_annotation_colors(motif_stats)

pheatmap(average_weights_matrix_z_scores[order(motif_stats$cluster_annot),],
         annotation_row=motif_stats[,c("cluster_annot",
                                       "motif_celltype_entropy")],
         show_rownames = FALSE, angle_col = 45,
         annotation_colors = annotation_colors,
         cluster_rows = FALSE,
         cellwidth = 8,
         cellheight = .5)

Plotting all motif family correlations

Using the following function, you can plot the correlations and expression values for the different motif families, while annotating the top x hits per motif family:

plot_motif_family_correlation_overview(cluster_correlations_df, top_n = 5)
## Warning: Removed 121 rows containing missing values (geom_label_repel).

Plotting motif family weights across cell types

It might make more sense to look at the motif weights individually to assess how the weights vary across cell types. For this I use the following function:

plot_motif_family_weights_across_celltypes(weights_matrix = weights_matrix,
                                           sce = sce,
                                           motif_stats = motif_stats,
                                           motif_family = "ETS family motifs")

(If this plot returns a blank plot for you, the motif_family you specified is probably not spelled right. You can check unique(motif_stats$cluster_annot) to see which motif families to choose from)

As you can see here, the ETS family motifs are higher in immune cells (on the right) and lowest in neurons (on the left).

Another example:

plot_motif_family_weights_across_celltypes(weights_matrix = weights_matrix,
                                           sce = sce,
                                           motif_stats = motif_stats,
                                           motif_family = "Egr1/Wt1/Klf family motifs")

sessionInfo()

sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
##
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas/liblapack.so.3
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets
## [8] methods   base
##
## other attached packages:
##  [1] reshape2_1.4.4              magrittr_1.5
##  [3] pheatmap_1.0.12             ggrepel_0.8.2
##  [5] xtable_1.8-4                stringr_1.4.0
##  [7] SingleCellExperiment_1.8.0  SummarizedExperiment_1.16.1
##  [9] DelayedArray_0.12.3         BiocParallel_1.20.1
## [11] matrixStats_0.56.0          Biobase_2.46.0
## [13] GenomicRanges_1.38.0        GenomeInfoDb_1.22.1
## [15] IRanges_2.20.2              S4Vectors_0.24.4
## [17] BiocGenerics_0.32.0         rhdf5_2.30.1
## [19] mixtools_1.2.0              igraph_1.2.5
## [21] ggplot2_3.3.1               dplyr_1.0.0
##
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.4.6           lattice_0.20-40        digest_0.6.25
##  [4] R6_2.4.1               plyr_1.8.6             evaluate_0.14
##  [7] pillar_1.4.4           zlibbioc_1.32.0        rlang_0.4.6
## [10] kernlab_0.9-29         Matrix_1.2-18          rmarkdown_2.2
## [13] labeling_0.3           splines_3.6.3          RCurl_1.98-1.2
## [16] munsell_0.5.0          compiler_3.6.3         xfun_0.14
## [19] pkgconfig_2.0.3        segmented_1.1-0        htmltools_0.4.0
## [22] tidyselect_1.1.0       gridExtra_2.3          tibble_3.0.1
## [25] GenomeInfoDbData_1.2.2 viridisLite_0.3.0      crayon_1.3.4
## [28] withr_2.2.0            MASS_7.3-51.5          bitops_1.0-6
## [31] grid_3.6.3             gtable_0.3.0           lifecycle_0.2.0
## [34] scales_1.1.1           stringi_1.4.6          farver_2.0.3
## [37] XVector_0.26.0         viridis_0.5.1          ellipsis_0.3.1
## [40] generics_0.0.2         vctrs_0.3.1            Rhdf5lib_1.8.0
## [43] RColorBrewer_1.1-2     tools_3.6.3            glue_1.4.1
## [46] purrr_0.3.4            survival_3.1-8         yaml_2.2.1
## [49] colorspace_1.4-1       knitr_1.28